A Statistical Investigation into the Geographic Characteristics of the DATA2X02 Cohort

Author

SID: 520468531

Published

September 10, 2023

1 Introduction

Code
# KNITR MUST BE VERSION 1.42 TO RENDER MAPS

# Library Imports
library('tidyverse')
library('janitor')
library("scales")
library("sf")
library('leaflet')
library('tippy')
library('xfun')
library('ggpubr')
library('flextable')

# Needed to clean names for the inline code during introduction. More involved cleaning will be discussed.
raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |>
  janitor::clean_names()

# Adding tool tips for a few key terms
tippy::tippy_this(elementId = "random_sample", 
                  tooltip = "When all members of a population have equal likelihood to be sampled.")
Code
tippy::tippy_this(elementId = "wam", 
                  tooltip = "Weighted Average Mark")

DATA2X02 is a group of two units – DATA2002 and DATA2902 – offered within the School of Mathematics and Statistics at The University of Sydney. The units teach “advanced data analytic skills for a wide range of problems and data” (The University of Sydney 2023) with a focus on statistical methods to analyse and answer a scientific question.

1.1 Survey Method and Random Sampling

The raw dataset provided was sourced from a cohort survey which aimed to gain insight into the units’ cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was 41%. It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a random sample of DATA2X02 students.

Students who were less engaged – possibly not attending lectures, labs, or interacting with the Ed Discussion Board – are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by DATA2902 (the advanced stream of DATA2X02) having a response rate of 71% compared to DATA2002’s rate of 37%. Students could also submit the survey multiple times, which may have skewed the data towards an individuals who submitted multiple responses, but EDA showed that this did not occur in a major way.

Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is taken that the survey still offers a moderately random sample of the DATA2X02 cohort and that responses were (for the most part) independent from one another. For more detailed analysis, a new dataset should be sourced from a different surveying method, ensuring that more students submit the survey and restricting responses to one per person.

1.2 Sources of Bias

There are some potential biases that may have occurred during this survey.

  • Non-response Bias – As discussed in Section 1.1, there may have been a non-response bias within the survey. Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole. This would be an issue if there is a major difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put more effort into their studies. Moreover, there is the possibility that students do not opt for an advanced unit in order to priorities other aspects of their lives, such as work.

  • Social desirability/conformity bias – Many of the questions asked in the survey have an associated ‘socially desirable’ response. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population. An example of this may be the question of whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn’t have experience may have answered incorrectly to conform with the rest of the cohort.

  • Recall Bias – Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer to a question. An example of this would be someone’s WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with three students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall or a deliberate distortion.

1.3 Possible Improvements

There are many possible improvements that help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was How much sleep do you get (on average, per day)?. A better wording of this question would be How much sleep (in hours per night) do you get on average?. This was also an issue for the question How tall are you?, where answers were not given in a uniform manner. Rewording to How tall are you in cm? would have produced data that required much less cleaning. This extends to What is your shoe size?, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European).

There were also issues regarding the categorical data. The question Would you prefer to study at Fisher Library or SciTech Library? did not need to include an Other response, as any answer of this type would not be answering the question asked. Moreover, the question Do you work? did not align with the suggested responses given. This question should have been What is your current employment status?. A similar issue was seen in this question Do you submit assignments on time?, which should have been How often do you submit assignments on time?. Finally, some questions could have included some options and an Other response, rather than free text. This was a particular issue for What brand is your laptop? and What is your favourite social media platform?, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.

1.4 Report Outline

This report will focus on the geographical characteristics of the cohort, with the Postcode of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student’s geographical region on a variety of variables.

SA4s are the “largest sub-State regions” and “represent labour markets or groups of labour markets within each State and Territory” (Australian Bureau of Statistics 2021), with each SA4 having approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with ‘geographical, social and economic similarities’ (Australian Bureau of Statistics 2021). Figure 1 is a map made using Leaflet (Cheng, Karambelkar, and Xie 2023) which showcases the SA4s of Greater Sydney1.

Code
sa4_df <- st_read('Data/1270055001_sa4_2016_aust_shape')
sa4_df_filter <- sa4_df |> filter(GCC_NAME16 == 'Greater Sydney')
Code
p_popup <- paste0("<strong>Name: </strong>", sa4_df_filter$SA4_NAME16)

leaflet(sa4_df_filter) %>%
  addPolygons(
    popup = p_popup,
    fillColor = '#fcbaa2',
    opacity = 1.0,
    weight = 2,
    color = "#d85f33",
    fillOpacity = 0.2) %>%
  addTiles()
Figure 1: Map of SA4s in Greater Sydney (Australian Statistical Geography Standard 2016), (Cheng, Karambelkar, and Xie 2023)

1.5 Data Cleaning

A variety of data cleaning has been done in R (R Core Team 2023) and R Studio (RStudio Team 2020) utilising the tidyverse packages (Wickham et al. 2019). The janitor package (Firke 2023) was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based on Tarr (2023). Some summary tables have also been created using gt (Iannone et al. 2023).

Column Name Conversion Table
Code
raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv')

old_names <- colnames(raw_df)

df <- raw_df

new_names <- c(
  "timestamp", "n_units", "task_approach", "age",
  "life", "fass_unit", "fass_major", "novel",
  "library", "private_health", "sugar_days", "rent",
  "post_code", "haircut_days", "laptop_brand",
  "urinal_position", "stall_position", "n_weetbix", "food_budget",
  "pineapple", "living_arrangements", "height", "uni_travel_method",
  "feel_anxious", "study_hrs", "work", "social_media",
  "gender", "sleep_time", "diet", "random_number",
  "steak_preference", "dominant_hand", "normal_advanced", "exercise_hrs",
  "employment_hrs", "on_time", "used_r_before", "team_role",
  "social_media_hrs", "uni_year", "sport", "wam", "shoe_size"
)

colnames(df) <- new_names

name_combo <- dplyr::bind_cols(`New Names` = new_names, 
                               `Original Names` = old_names)

name_combo |>
  gt::gt() |>
  gt::tab_header(title = "Column Name Cleaning") |>
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Column Name Cleaning
New Names Original Names
timestamp Timestamp
n_units How many units are you enrolled in this semester?
task_approach When it comes to assignments / due tasks do you:
age How old are you?
life Do you tend to lean towards saying "yes" or towards saying "no" to things throughout life?
fass_unit Have you taken one or more units of study from the Faculty of Arts and Social Sciences?
fass_major Are you completing a major or minor in a subject area from the Faculty of Arts and Social Sciences?
novel Have you read a novel this year?
library Would you prefer to study at Fisher Library or SciTech Library?
private_health Do you have private health insurance?
sugar_days How many days in a week you normally consume sweets/chocolates/sugary drinks? (Exclude Diet/Sugar Free Drinks & sweets)?
rent Do you pay rent?
post_code What is your post code?
haircut_days How many days do you go between haircuts (on average)?
laptop_brand What brand is your laptop?
urinal_position You enter a public bathroom and find you're the only one there. There are three urinals on the wall for you to choose from. Which do you choose?
stall_position You enter a public bathroom and there are three stalls to choose from. All three are unoccupied. Which do you choose?
n_weetbix How many Weet-Bix would you typically eat in one sitting?
food_budget What is the average amount of money you spend each week on food/beverages?
pineapple Do you like pineapple on pizza?
living_arrangements What are your current living arrangements?
height How tall are you?
uni_travel_method How do you get to university?
feel_anxious How often would you say you feel anxious on a daily basis?
study_hrs How many hours a week do you spend studying?
work Do you work?
social_media What is your favourite social media platform?
gender What is your gender?
sleep_time How much sleep do you get (on avg, per day)?
diet What is your diet style?
random_number Pick a number at random between 0 and 9
steak_preference How do you like your steak cooked?
dominant_hand What is your dominant hand?
normal_advanced Which unit are you enrolled in?
exercise_hrs On average, how many hours each week do you spend exercising?
employment_hrs How many hours a week (on average) do you work in paid employment?
on_time Do you submit assignments on time?
used_r_before Have you ever used R before starting DATA2x02?
team_role What kind of role (active or passive) do you think you are when working as part of a team?
social_media_hrs How many hours do you spend on social media per day?
uni_year Which year of university are you currently in?
sport Which sports do you play most often?
wam What is your WAM?
shoe_size What is your shoe size?
Figure 2: Column Name Conversion Table (Tarr 2023), (Iannone et al. 2023)

The SA4 name of each respondent was joined to the survey data using a reference table made by Proctor (2023). The HTML Table on the website (Proctor 2023) was converted into a CSV file for easier manipulation (Data Design Group 2023).

Code
sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |> 
  select(c(`Postcode`, `SA4 Name`)) |> 
  unique() |> 
  filter(!((`Postcode` == 2232) & (`SA4 Name` == 'Southern Highlands and Shoalhaven')))

colnames(sa4_postcode_df) <- c('post_code', 'sa4_name')

sa4_postcode_df$post_code <- as.character(sa4_postcode_df$post_code) 

df$post_code <- as.character(gsub("[^0-9]", "", df$post_code))

df <- df |> left_join(sa4_postcode_df)

df |> count(sa4_name) |>
  arrange(desc(n)) |> 
  gt::gt() |> 
  gt::cols_label(sa4_name = "SA4 Name", n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by SA4") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Count of Students by SA4
SA4 Name Count of Students
Sydney - City and Inner South 123
Sydney - Inner West 35
NA 31
Sydney - North Sydney and Hornsby 30
Sydney - Ryde 18
Sydney - Inner South West 16
Sydney - Parramatta 14
Sydney - Northern Beaches 11
Sydney - Eastern Suburbs 9
Sydney - Blacktown 6
Sydney - Outer West and Blue Mountains 6
Sydney - South West 5
Sydney - Baulkham Hills and Hawkesbury 2
Sydney - Outer South West 2
Sydney - Sutherland 2
Central Coast 1
Riverina 1
Figure 3: Students’ SA4s Summary Table (Iannone et al. 2023)

The SA4s were further grouped together geographically to collapse some of the groups with lower student counts. Figure 4 is a map of the groupings of SA4s into regions. A conversion table was generated using flextable (Gohel and Skintzos 2023).

SA4 to Region Conversion Table
Code
north_sydney = c('Sydney - North Sydney and Hornsby', 
                 'Sydney - Ryde', 
                 'Sydney - Northern Beaches')
city_and_eastern_suburbs = c('Sydney - City and Inner South', 
                             'Sydney - Eastern Suburbs')
inner_west = c('Sydney - Inner West', 
               'Sydney - Parramatta', 
               'Sydney - Inner South West')

df <- df |> 
  mutate(geographic_regions = case_when(
    sa4_name %in% north_sydney ~ 'North Sydney',
    sa4_name %in% city_and_eastern_suburbs ~ 'City and Eastern Suburbs',
    sa4_name %in% inner_west ~ 'Inner West',
    !is.na(sa4_name) ~ 'Outer South West, Greater Sydney and Regional NSW',
    TRUE ~ NA
  ))

mapping_df <- df |> select(geographic_regions, sa4_name) |> 
  unique() |> 
  drop_na() |> 
  arrange(geographic_regions) |>
  mutate(`Region` = geographic_regions, `SA4 Name`=sa4_name) |> 
  select(Region, `SA4 Name`)

flextable(mapping_df) |> 
  merge_v() |> 
  theme_vanilla() |> 
  width(2, 4) |> 
  width(1, 2)

Region

SA4 Name

City and Eastern Suburbs

Sydney - City and Inner South

Sydney - Eastern Suburbs

Inner West

Sydney - Inner West

Sydney - Inner South West

Sydney - Parramatta

North Sydney

Sydney - North Sydney and Hornsby

Sydney - Ryde

Sydney - Northern Beaches

Outer South West, Greater Sydney and Regional NSW

Sydney - Blacktown

Sydney - Outer West and Blue Mountains

Sydney - South West

Central Coast

Riverina

Sydney - Baulkham Hills and Hawkesbury

Sydney - Sutherland

Sydney - Outer South West

Code
sa4_df_in_survey <- sa4_df |> filter(SA4_NAME16 %in% df$sa4_name)

sa4_df_in_survey <- sa4_df_in_survey |> 
  mutate(geographic_regions = case_when(
    SA4_NAME16 %in% north_sydney ~ 'North Sydney',
    SA4_NAME16 %in% city_and_eastern_suburbs ~ 'City and Eastern Suburbs',
    SA4_NAME16 %in% inner_west ~ 'Inner West',
    !is.na(SA4_NAME16) ~ 'Outer South West, Greater Sydney and Regional NSW',
    TRUE ~ NA
  ))

factpal <- colorFactor(c('darkgreen', 'darkblue', 'darkred', 'purple'), sa4_df_in_survey$geographic_regions)
p_popup <- paste0("<strong>Name: </strong>", sa4_df_in_survey$SA4_NAME16)

leaflet(sa4_df_in_survey) |> 
  addPolygons(
    popup = p_popup,
    fillColor = ~factpal(geographic_regions),
    opacity = 1.0,
    weight = 2,
    color = ~factpal(geographic_regions),
    fillOpacity = 0.1) |> 
  addTiles() |> 
  addLegend("bottomleft", 
            pal = factpal, 
            values = ~geographic_regions, 
            title='Region')
Figure 4: Map of SA4s grouped into Regions for students in DATA2X02 (Cheng, Karambelkar, and Xie 2023)


A flagging column was made that identified if someone travelled to the university by car.

Code
df <- df |>
  mutate(car_flag = 
           ifelse(str_detect(uni_travel_method, "Car"),
                  "Drive", 
                  ifelse(is.na(uni_travel_method), NA, "Other")))

df |> count(car_flag) |> 
  gt::gt() |> 
  gt::cols_label(car_flag = "Does the Student Drive to Univeristy?", 
                 n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by Whether or Not they Travel by Car") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Count of Students by Whether or Not they Travel by Car
Does the Student Drive to Univeristy? Count of Students
Drive 66
Other 242
NA 4
Figure 5: Students’ Method of Travel Summary Table (Iannone et al. 2023)


The employment hours of each respondent were binned into categories of \(0\) hours, \(1-10\) hours, and \(11+\) hours. This was done to provide a more numerical approach to a respondent’s employment status. Even if someone were to have the same employment status, their actual work hours may be greatly different. So these bins were created in order to have a better understanding of how much a respondent works. Moreover, bins were used as a near majority (45%) of respondents worked no hours, which skewed the means of groups towards \(0\).

Code
bin_ranges <- c(0, 1, 10.5, Inf)
bin_labels <- c("0", "1-10","11+")

# Create a new column with binned values
df$employment_hrs_bin <- cut(df$employment_hrs, 
                             breaks = bin_ranges, 
                             labels = bin_labels, 
                             include.lowest = TRUE)

df |> count(employment_hrs_bin) |> 
  gt::gt() |> 
  gt::cols_label(employment_hrs_bin = "Employment Hours", 
                 n='Count of Students') |> 
  gt::tab_header(title = "Count of Students by Employment Hours") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Count of Students by Employment Hours
Employment Hours Count of Students
0 144
1-10 72
11+ 79
NA 17
Figure 6: Students’ Employment Hours Summary Table (Iannone et al. 2023)


Outliers of WAM were set to NA, as this may be international students who have a different WAM system or people who do not know their WAM. It was judged at the \(\pm 3\) standard deviations from the mean, which is a common method for removing outlines. Figure 7 shows a histogram of respondents’ WAM.

Code
remove_outlier <- function(vec){
  threshold1 = mean(vec[!is.na(vec)]) + 3*sd(vec[!is.na(vec)])
  threshold2 = mean(vec[!is.na(vec)]) - 3*sd(vec[!is.na(vec)])
  vec[vec > threshold1 | vec < threshold2] <- NA
  return(vec)
}

df[['wam']] <- remove_outlier(df[['wam']])

df %>%
  ggplot(aes(x=wam)) + 
  geom_histogram(bins = 20, 
                 fill = "#fcbaa2", 
                 color = "#d85f33") + 
  labs(x="WAM", 
       y="Frequency", 
       title="Histogram of Students' WAM with Outliers Removed") + 
  theme(legend.position="none", 
        plot.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0), 
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 13, 
                                  hjust = 0.5))

Figure 7: Histogram of students’ WAM with outliers removed (Wickham 2016)


In all hypothesis tests, the NA values of the columns investigated were removed as these either represented answers that were not filled out from the respondent, or values which were deemed outliers.

2 Hypothesis Testing

2.1 Does living in Sydney’s City and Eastern Suburbs influence if students drive to university?

As the University of Sydney is located in Sydney’s City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to the university, if they live close to the university. This is of interest as effective carbon emissions of the University can be reduced if more students use public transport. This is also suggested from EDA which showed uneven proportions between the two groups, as seen in Figure 8.

Code
car_df <- df |> 
  select(c(geographic_regions, car_flag)) |> 
  mutate(geographic_regions =  
           ifelse(geographic_regions == 'City and Eastern Suburbs', 
                  'City and Eastern Suburbs', 
                  'Other')) |> 
  drop_na() |> 
  mutate(`Travel Method` = car_flag)

car_df |> ggplot() + 
  aes(x=geographic_regions, fill=`Travel Method`) + 
  geom_bar(colour = "black",
           linewidth = 0.5,
           position = "fill") + 
  labs(y="Proportion of Travel Method",
       x="Region", 
       title="Proportion of Students who drive to Univerity \n based on Geograhical Location",
       legend="Travel Method") + 
  theme(plot.background = element_rect(fill = "#ffffff",
                                       linewidth = 0),
        legend.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + 
  scale_y_continuous(labels = scales::percent) + 
  scale_fill_brewer(palette = "Set2")

Figure 8: Proportion bar chart of travel method for different regions (Wickham 2016)

A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. A Monte-Carlo simulation of size \(6000\) was used to calculate the distribution of test statistics and \(p\) value.

Code
contingency_table <- 
  table(car_df$geographic_regions, car_df$car_flag) |>
  as.data.frame.matrix()

contingency_table$`Region` = c('City and Eastern Suburbs', 'Other')

contingency_table |> gt::gt() |> 
  gt::cols_move_to_start(columns=c(`Region`)) |> 
  gt::tab_spanner(label = "Method of Travel", columns = 1:2) |> 
  gt::tab_header(title = "Count of Students by Method of Travel") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Count of Students by Method of Travel
Region Method of Travel
Drive Other
City and Eastern Suburbs 13 118
Other 48 101
Figure 9: Contingency Table of Region and Method of Travel (Iannone et al. 2023)
Code
set.seed(1)
test <- chisq.test(table(car_df$car_flag, car_df$geographic_regions),
                   simulate.p.value=TRUE, 
                   B=6000)
\(\chi^2\)-test for independence
  1. Hypothesis\(H_0\): The method of travel of a student is independent of whether or not they live in Sydney’s City and Eastern Suburbs. \(H_1\): There is some interdependence between the a student’s method of travel and whether or not they live in Sydney’s City and Eastern Suburbs.

  2. Assumptions - The test statistic is a good measure of independence and that the resampling process provides a good estimation for the distribution of the test statistic. These assumptions are true as, at the limit of resampling an infinite amount of times, the test statistic is distributed the same as a \(\chi^2\)-test for independence. The observations are independent. Despite some concerns raised in Section 1.1, the observations are likely to be independent from each other.

  3. Test Statistic\[T = \sum_{i=1}^2 \sum_{j=1}^2 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\]

  4. Observed Test Statistic\(t_0=\) 20.33.

  5. p-value – The proportion of simulated test statistics that were as or more extreme than \(t_0\) was \(p=\) 0.00017.

  6. Decision – As the \(p\) -value was \(<\alpha\), we cannot reject \(H_0\). This implies that there is some interdependence between the method of travel and whether or not a student lives in Sydney’s City and Eastern Suburbs.

2.2 Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?

A student’s WAM is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful, as it could allow the University to provide targeted academic support.

Code
wam_df <- df |> 
  filter(geographic_regions %in% c('Inner West', 'North Sydney')) |> 
  select(geographic_regions, wam) |> 
  drop_na()

inner_west_wam <- filter(wam_df, geographic_regions=="Inner West")$wam

north_sydney_wam <- filter(wam_df, geographic_regions=="North Sydney")$wam

A Welch two-sample one-sided \(t\)-test at the \(\alpha = 0.05\) level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. This test is less powerful then a two-sample t-test, but forgoes the assumption of equal variance. This was done because North Sydney has a variance of 88.8 compared to Inner West having 101.5.

Initial EDA suggests this may be the case, with the mean WAM of students being 76.4 and 74.1 respectively. We can also generate a QQ-plot of students’ WAM, which shows the variable is likely to be normally distributed as it follows a linear regression.

Code
wam_df |> ggplot() + 
  aes(y=wam, 
      color=geographic_regions, 
      x=geographic_regions, 
      fill=geographic_regions) + 
  geom_boxplot()+ 
  geom_jitter(width=0.05, size=1) +
  scale_color_manual(values = c('darkblue','darkred')) + 
  scale_fill_manual(values = c(rgb(214/255, 214/255, 232/255), 
                               rgb(230/255, 215/255, 214/255))) +
  theme(legend.position="none") + 
  theme(plot.background = element_rect(fill = "#ffffff",
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + 
  labs(y="WAM",
       x="Region", 
       title="A: Grouped Box Plot of WAM by Region")

ggqqplot(wam_df, 
         x = "wam", 
         facet.by = "geographic_regions", 
         color = "geographic_regions", 
         palette=c('darkblue','darkred'), 
         legend='none', 
         title="B: QQ-plot of WAM") + 
  theme(plot.background = element_rect(fill = "#ffffff",
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", 
                                    fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5))

test <- t.test(north_sydney_wam, inner_west_wam, alternative = 'greater')

shapiro1 <- shapiro.test((wam_df |> filter(geographic_regions == 'Inner West'))$wam)
shapiro2 <- shapiro.test((wam_df |> filter(geographic_regions == 'North Sydney'))$wam)

degrees_of_freedom <- test$parameter

Figure 10: Box plot of WAMs of students from the Inner West and North Sydney (Wickham 2016)

Figure 11: QQ-plot of WAMs of students from the Inner West and North Sydney (Kassambara 2023)
Code
wam_df |> ggplot() + 
  aes(color=geographic_regions, x=wam, fill=geographic_regions) + 
  geom_histogram(bins=15) + 
  facet_wrap(~geographic_regions, scales = "free_y") +
  scale_color_manual(values = c('darkblue','darkred')) + 
  scale_fill_manual(values = c(rgb(214/255, 214/255, 232/255), rgb(230/255, 215/255, 214/255))) +
  labs(title="C: Histogram of WAM based on Region",
       x="WAM",
       y="Frequency") +
  theme(plot.background = element_rect(fill = "#ffffff",
                                       linewidth = 0),
        legend.background = element_rect(fill = "#ffffff",
                                         linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold", size=16),
        axis.text = element_text(size = 14),
        legend.text = element_text(size = 12),
        plot.title = element_text(face="bold",
                                  size = 18,
                                  hjust = 0.5),
        legend.position = "none")

Figure 12: Histogram of WAMs of students from the Inner West and North Sydney (Wickham 2016)


Welch two-sample one-sided \(t\)-test
  1. Hypothesis\(H_0\): The mean WAM of students from North Sydney \(\mu_{NS}\) equals the mean WAM of students from the Inner West \(\mu_{IW}\). \(H_1\): \(\mu_{NS}\) is greater than \(\mu_{NS}\).

  2. Assumptions – The observations of both groups were independently and identically distributed to \(\mathcal{N}(\mu_{i}, \sigma_{i}^2)\) for \(i=NS, IW\), and that the observations of each group were independent. Despite some concerns raised in Section 1.1, the groups are likely to be independent from each other. A student may have responded twice with different values of WAM or Postcode, which could make the groups not independent. The above QQ-plot (Figure 11) shows that the WAM is likely to be normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a \(X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2)\), with p values of 0.223 for Inner West and 0.273 for North Sydney.

  3. Test Statistic\[T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}\] Here, \(S_{ns}^2\) and \(S_{iw}^2\) are the sample variance of the \(NS\) (North Sydney) and \(IW\) (Inner West) samples. Under \(H_0\), \(T\sim t_{\nu}\), where \(\nu=\) 104.47 as estimated from the data.

  4. Observed Test Statistic\(t_0=\) 1.24

  5. p-value\(p = P\left(t_\nu \geq t_0\right)=\) 0.109

  6. Decision – As the \(p\) -value was \(>\alpha\), we cannot reject \(H_0\) and say the data is consistent with the mean WAM of students in North Sydney and the Inner West being equal.

2.3 Does a student’s Region have a significant influence on how many hours they work?

Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion of students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no hours or more than 11 hours a week.

Code
employment_df <- df |> 
  select(c(geographic_regions, employment_hrs_bin)) |> 
  drop_na() |> 
  mutate(`Employment Hours per Week` = employment_hrs_bin)

employment_df |> 
  ggplot() + 
  aes(x=geographic_regions, fill=`Employment Hours per Week`) + 
  geom_bar(colour = "black",
           linewidth = 0.5,
           position = "fill") + 
  labs(y="Proportion of Hours Worked Category",
       x="Region", 
       title="Proportion of Students in Hours Worked by Region",
       legend="Travel Method") + 
  theme(plot.background = element_rect(fill = "#ffffff",
                                       linewidth = 0),
        legend.background = element_rect(fill = "#ffffff", 
                                       linewidth = 0),
        panel.border = element_rect(colour = "black", fill=NA),
        legend.box.background = element_rect(colour = "black"),
        axis.title = element_text(face="bold"), 
        plot.title = element_text(face="bold", 
                                  size = 14, 
                                  hjust = 0.5)) + 
  scale_y_continuous(labels = scales::percent) + 
  scale_x_discrete(labels=c("City and \n Eastern Suburbs", 
                            "Inner West", 
                            "North Sydney", 
                            "Outer South West, \n Greater Sydney and \n Regional NSW")) + scale_fill_brewer(palette = "Set2")

Figure 13: Proportion bar chart of hours worked for different regions (Wickham 2016)

A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. Yates’s correction for continuity was used in the test.

Code
contingency_table <- table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |> as.data.frame.matrix()

contingency_table$`Region` = c('City and Eastern Suburbs', 
                               'Inner West', 
                               'North Sydney', 
                               'Outer South West,\n Greater Sydney and \n Regional NSW')

contingency_table |> gt::gt() |> 
  gt::cols_move_to_start(columns=c(`Region`)) |> 
  gt::tab_spanner(label = "Hours Worked", columns = 1:3) |> 
  gt::tab_header(title = "Count of Students by Hours Worked") |> 
  gt::tab_options(heading.title.font.weight = 'bolder', 
                  column_labels.font.weight = 'bold')
Count of Students by Hours Worked
Region Hours Worked
0 1-10 11+
City and Eastern Suburbs 82 27 19
Inner West 34 15 14
North Sydney 15 15 27
Outer South West, Greater Sydney and Regional NSW 6 8 11
Figure 14: Contingency Table of Region and Hours Worked (Iannone et al. 2023)
Code
test <- chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))

degrees_of_freedom <- test$parameter
\(\chi^2\)-test for independence
  1. Hypothesis\(H_0\): The amount of hours worked by a student is independent of their region. \(H_1\): There is some interdependence between the amount of hours worked and region.

  2. Assumptions – The observations are independent, and the expected cell counts are greater than or equal to 5. Despite some concerns raised in Section 1.1, the observations are likely to be independent from each other. This is, however, a limitation of the test as it is hard to absolutely verify this assumption. There were zero expected cell counts less than 5, so these assumptions hold.

  3. Test Statistic\[T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_{6}\).

  4. Observed Test Statistic\(t_0=\) 35.82

  5. p-value\(p=P(\chi^2_{6} \geq t_0)<0.0001\).

  6. Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that there is some interdependence between hours worked in a week and a student’s region.

3 Conclusion

The geographic characteristics have been investigated in this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.

Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of Travel Method and Employment Hours per Week. Specifically, it was found that the method of travel of students is dependent on whether or not they live in Sydney’s City or Eastern Suburbs, and employment hours per week were dependent on region. There was no statistically significant evidence to suggest that the mean WAM of students from North Sydney is greater than those living in the Inner West.

Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent with all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode. An example of this could be using a respondent’s address to find the straight-line distance from their residents to the University, and investigate how this may influence student’s answers. If this was to happen, more strict data security would be required as a respondent’s address could be considered too personal to be released to every DATA2X02 student.

References

Australian Bureau of Statistics. 2021. “Statistical Area Level 4.” 2021. https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/main-structure-and-greater-capital-city-statistical-areas/statistical-area-level-4.
Australian Statistical Geography Standard. 2016. “Main Structure and Greater Capital City Statistical Areas.” 2016. https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.001July%202016?OpenDocument.
Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2023. Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’library. https://CRAN.R-project.org/package=leaflet.
Data Design Group, Inc. 2023. “HTML Table to CSV/Excel Convertere.” 2023. https://www.convertcsv.com/html-table-to-csv.htm.
Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Gohel, David, and Panagiotis Skintzos. 2023. Flextable: Functions for Tabular Reporting. https://CRAN.R-project.org/package=flextable.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2023. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
Kassambara, Alboukadel. 2023. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.
Proctor, Mattthew. 2023. “Postcodes in New South Wales (NSW).” 2023. https://www.matthewproctor.com/full_australian_postcodes_nsw.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
RStudio Team. 2020. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.
Tarr, Garth. 2023. “DATA2002 Assignment: Data Importing and Cleaning Guide.” 2023. https://pages.github.sydney.edu.au/DATA2002/2023/assignment/assignment_data.html.
The University of Sydney. 2023. “DATA2902.” 2023. https://www.sydney.edu.au/units/DATA2902.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Footnotes

  1. Shape Files used in this map are available here (Australian Statistical Geography Standard 2016)↩︎